Introduction

Row

Overview

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC – THE DATA MODELING FRAMEWORK

  • DEFINE the Problem
  • COLLECT the Data from Appropriate Sources
  • ORGANIZE the Data Collected
  • VISUALIZE the Data by Developing Charts
  • ANALYZE the data with Appropriate Statistical Methods
  • COMMUNICATE your Results

Row

The Problem & Data Collection

The Problem

Unemployment is a constant issue in developed societies that only grows worse. Various organizations have created programs to assist unemployed people with resources to find employment. This analysis aims to evaluate the efficiency of techniques used to combat unemployment and analyze the demographic features of the unemployed population.

The Data

Jobs II data is a job search intervention study investigating the efficacy of a job training intervention on the unemployed. The program is designed to increase reemployment among the unemployed and enhance the mental health of job seekers. During the study, the subjects participated in job-sills workshops that taught skills for finding a new job and ways to handle setbacks pertaining to the employment process. The original data set contains 899 rows and 17 columns.

Data Sources

Vinokur, A. and Schul, Y. (1997). Mastery and inoculation against setbacks as active ingredients in the jobs intervention for the unemployed. Journal of Consulting and Clinical Psychology 65(5):867-77.

The Data

VARIABLES TO PREDICT WITH

  • Treatment: Indicator variable for whether participant was randomly selected for the JOBS II training program. 1 = assignment to participation.
  • Econ_Hardship: Level of pre-treatment economic hardship determined with a pre-screening questionnaire. Continuous values from 1 to 5, 5 being the highest level of economic hardship.
  • Age: Age of participant during pre-screening questionnaire. Continuous values based on the day, month, and year.
  • Job_Seek: Measure of job-search self-efficacy shown by the participant during the study. Continuous values from 1 to 5, 5 being the highest level of self-efficacy.
  • Marital_Status: Marital status of the participant during pre-screening questionnaire. 5 categories: : married, never_married, separated, divorced, widowed.
  • White_Nonwhite: Whether the participant is white or of a different race. 2 categories: white, nonwhite.
  • Education: Level of previous education completed during pre-screening questionnaire. 5 categories: graduate_work, some_college, bachelors_degree, highschool_degree, some_highschool.
  • Sex: Sex of the participant. 2 Categories: female, male.

VARIABLES WE WANT TO PREDICT

  • Employment: If the participant gained employment after the study. Assessed in a follow-up interview. 2 categories: employed, unemployed
  • Depression: Measure of depressive symptoms pre-treatment determined with a pre-screening questionnaire. Continuous values from 1 to 3, 3 being the highest level of depressive symptoms.

Data

Column

Organize the Data

After organization, the clean data set contains 200 rows and 10 columns with no missing values Some variables and rows of data were removed from the original data to increase readability and improve the efficiency of predictive models.

 Treatment Econ_Hardship    Depression         Age           Job_Seek    
 0: 62     Min.   :1.00   Min.   :1.000   Min.   :18.27   Min.   :1.833  
 1:137     1st Qu.:2.33   1st Qu.:1.360   1st Qu.:29.78   1st Qu.:3.667  
           Median :3.00   Median :1.800   Median :37.28   Median :4.000  
           Mean   :2.99   Mean   :1.839   Mean   :38.23   Mean   :4.059  
           3rd Qu.:3.67   3rd Qu.:2.270   3rd Qu.:44.74   3rd Qu.:4.667  
           Max.   :5.00   Max.   :3.000   Max.   :67.50   Max.   :5.000  
       Marital_Status  White_Nonwhite             Education      Sex     
 divorced     :34     nonwhite: 42    bachelors_degree :33   female:106  
 married      :86     white   :157    graduate_work    :21   male  : 93  
 never_married:66                     highschool_degree:64               
 separated    :11                     some_college     :73               
 widowed      : 2                     some_highschool  : 8               
                                                                         
      Employment 
 employed  : 61  
 unemployed:138  
                 
                 
                 
                 

From this data we can see that our variables have a variety of different values based on their types. The summary statistics of the organized data set give insight on the participants of the study and the general unemployed population.

Observations Include:

  1. Over two-thirds of the participants in the data set were assigned to be in the study from an outside source and did not seek employment assistance on their own.

  2. Most participants experienced levels of economic hardship above the average citizen.

  3. Most participants experienced depressive symptoms prior to the study.

  4. The average participant age was 38 years old.

  5. Most participants showed high job-search self-efficacy.

  6. The most common marital status of participants was married.

  7. Over three-fourths of the participants were white.

  8. The most common education level of participants was some college. Participants with a high school degree closely followed.

  9. There was a similar amount of men participants in the study as there was women.

  10. Less than one-third of participants gained employment after the study.

Column

Data Visualization: Average Depression by Education Level

Column

How does average level of depression vary by education levels?

In order to determine if education group could have a significant impact on level depression, a chart displaying the average depression level for each education group.

We can see that the average depression level after the study is similar across all categories of education. The categories some_highschool and graduate_work had the highest average levels of depression reported after the study was complete. This bar chat tells us that a participants level of educated completed will not have a significant influence on the predicted depression level of a participant. However, participants that have completed some high school education or graduate work may show higher depression levels.

Data Visualization: Distributions and Count

Row

In order to understand what variables will be significant for predictive modeling, distributions of continuous variables and counts of categorical variables were analyzed.

Row

Continuous Predictor Variables

We can see from the histograms that the distribution of age is fairly spread out, and concentrated from 25 years of age to 45. The distribution for economic hardship levels is concentrated in the middle. The distribution of Job-Search Self Efficacy is skewed right, telling us most participants showed strong efforts to find employment.

Row

Categorical Predictor Variables

We can see from the counts that sex or education level will most likely not have a significant impact on depression level as they are fairly evenly distributed.

Row

Response Variables

We can see that the distribution of the depression response variable is skewed left. The count of employment also shows us that the study was only one-third successful. This tells us that it is more likely for participants to not gain employment after the study was complete.

Depression Analysis

Row

Predict Median Value

For this analysis we will use a Linear Regression Model.

Adjusted R-Squared

19 %

RMSE

0.49

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
Econ_Hardship 0.230 0.038 6.116 0.000
(Intercept) 1.620 0.330 4.912 0.000
Sexmale -0.236 0.073 -3.233 0.001
Age -0.010 0.004 -2.463 0.015
White_Nonwhitewhite 0.213 0.092 2.320 0.021
Educationsome_college -0.146 0.107 -1.368 0.173
Employmentunemployed 0.086 0.082 1.052 0.294
Educationgraduate_work 0.144 0.140 1.027 0.306
Job_Seek -0.050 0.055 -0.902 0.368
Marital_Statusnever_married 0.063 0.116 0.545 0.587
Marital_Statusseparated 0.093 0.176 0.527 0.599
Educationsome_highschool 0.071 0.198 0.357 0.721
Educationhighschool_degree -0.036 0.108 -0.332 0.741
Marital_Statusmarried 0.033 0.102 0.325 0.746
Marital_Statuswidowed 0.069 0.369 0.187 0.852
Treatment1 0.008 0.077 0.101 0.919

Residual Assumptions Explorations

Row

Analysis Summary

After examining this model, we determine that there are some predictors that are not important in predicting the depression level, so a pruned version of the model is created by removing predictors that are not significant.

Row

Predict Median Value Final Version

For this analysis we will use a pruned Linear Regression Model. We removed Treatment and Martial_Status

Adjusted R-Squared

21 %

RMSE

0.49

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
Econ_Hardship 0.231 0.036 6.358 0.000
(Intercept) 1.701 0.286 5.943 0.000
Sexmale -0.235 0.071 -3.296 0.001
Age -0.011 0.003 -3.152 0.002
White_Nonwhitewhite 0.208 0.089 2.341 0.020
Educationsome_college -0.145 0.105 -1.385 0.168
Employmentunemployed 0.089 0.080 1.103 0.272
Educationgraduate_work 0.150 0.136 1.098 0.274
Job_Seek -0.051 0.054 -0.939 0.349
Educationsome_highschool 0.078 0.195 0.398 0.691
Educationhighschool_degree -0.035 0.106 -0.326 0.745

Residual Assumptions Explorations

Row

Analysis Summary

Reducing the predictors that did help with the prediction of depression levels. The R-square increased, showing an improvement in the fit of the model.

The model shows us that economic hardship, age, and sex were all significant variables for prediction of depressionlevelsl.

From the following table, we can see the effect of the Depression level by the predictor variables.

Variable Direction
Econ_Hardship Increase
Sex(male) Decrease
age Decrease
White_Nonwhite(white) Increase
Education(some_college Decrease
Employment(unemployed) Increase
Education(graduate_work) Increase
job_seek Decrease
Education(some_highscool) Increase
Education(highschool_degree) Decrease

Employment Analysis

Row

Predict Employment with a Neural Network Model

Row

Analysis

The neural network model has a fairly high misclassification rate. With over one-fourth of the data misclassified for the validation and testing set, a neural network model is not the best fit for this set of data.

Row

Predict Employment with a Boosted Tree Model

Row

Analysis

The boosted tree model has a similar misclassification rate but a higher r-squared value. Each variable is used in a significant amount of splits. Therefore, the variables will not be pruned for this model. The column contribution table shows us that sex, age, and job_seek were the most significant variables for prediction in this model. A boosted tree model is more effective for analysis than a neural network because the r-squared is higher. As well, highas er degree of explanation in a boosted tree model than in a neural network.

---
title: "Jobs II Data Analysis: Job Training Efficiency for the Unemployed"
output: 
  flexdashboard::flex_dashboard:
    vertical_layout: scroll
    source_code: embed
---

```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(ggplot2)
library(flexdashboard)
library(tidyverse)
library(GGally)
library(dplyr)
library(caret) #for logistic regression
library(broom)#for tidy() function
library(png)
library(imager)
library(graphics)
library(jpeg)
library(Hmisc)
```

```{r load_data}
#Read in Data 
df <- read_csv("C:/Users/Tates/Desktop/3200Project/JobsIICleanData.csv")
#Change data types for summary stats
df$Marital_Status <- as.factor(df$Marital_Status)
df$Employment <- as.factor(df$Employment)
df$White_Nonwhite <- as.factor(df$White_Nonwhite)
df$Education <- as.factor(df$Education)
df$Treatment <- as.factor(df$Treatment)
df$Sex <- as.factor(df$Sex)
```

# Introduction {data-orientation="rows"}

## Row {data-height="250"}

### Overview

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC -- THE DATA MODELING FRAMEWORK

-   DEFINE the Problem
-   COLLECT the Data from Appropriate Sources
-   ORGANIZE the Data Collected
-   VISUALIZE the Data by Developing Charts
-   ANALYZE the data with Appropriate Statistical Methods
-   COMMUNICATE your Results

## Row {data-height="650"}

### The Problem & Data Collection

#### The Problem

Unemployment is a constant issue in developed societies that only grows worse. Various organizations have created programs to assist unemployed people with resources to find employment. This analysis aims to evaluate the efficiency of techniques used to combat unemployment and analyze the demographic features of the unemployed population.

#### The Data

Jobs II data is a job search intervention study investigating the efficacy of a job training intervention on the unemployed. The program is designed to increase reemployment among the unemployed and enhance the mental health of job seekers. During the study, the subjects participated in job-sills workshops that taught skills for finding a new job and ways to handle setbacks pertaining to the employment process. The original data set contains 899 rows and 17 columns.

#### Data Sources

Vinokur, A. and Schul, Y. (1997). Mastery and inoculation against setbacks as active ingredients in the jobs intervention for the unemployed. Journal of Consulting and Clinical Psychology 65(5):867-77.

### The Data

VARIABLES TO PREDICT WITH

-   *Treatment*: Indicator variable for whether participant was randomly selected for the JOBS II training program. 1 = assignment to participation.
-   *Econ_Hardship*: Level of pre-treatment economic hardship determined with a pre-screening questionnaire. Continuous values from 1 to 5, 5 being the highest level of economic hardship.
-   *Age*: Age of participant during pre-screening questionnaire. Continuous values based on the day, month, and year.
-   *Job_Seek*: Measure of job-search self-efficacy shown by the participant during the study. Continuous values from 1 to 5, 5 being the highest level of self-efficacy.
-   *Marital_Status*: Marital status of the participant during pre-screening questionnaire. 5 categories: : married, never_married, separated, divorced, widowed.
-   *White_Nonwhite*: Whether the participant is white or of a different race. 2 categories: white, nonwhite.
-   *Education*: Level of previous education completed during pre-screening questionnaire. 5 categories: graduate_work, some_college, bachelors_degree, highschool_degree, some_highschool.
-   *Sex*: Sex of the participant. 2 Categories: female, male.

VARIABLES WE WANT TO PREDICT

-   *Employment*: If the participant gained employment after the study. Assessed in a follow-up interview. 2 categories: employed, unemployed
-   *Depression*: Measure of depressive symptoms pre-treatment determined with a pre-screening questionnaire. Continuous values from 1 to 3, 3 being the highest level of depressive symptoms.

# Data

## Column {data-width="650"}

### Organize the Data

After organization, the clean data set contains 200 rows and 10 columns with no missing values Some variables and rows of data were removed from the original data to increase readability and improve the efficiency of predictive models.

```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.
#View data
print(summary(df))
```

From this data we can see that our variables have a variety of different values based on their types. The summary statistics of the organized data set give insight on the participants of the study and the general unemployed population.

Observations Include:

1.  Over two-thirds of the participants in the data set were assigned to be in the study from an outside source and did not seek employment assistance on their own.

2.  Most participants experienced levels of economic hardship above the average citizen.

3.  Most participants experienced depressive symptoms prior to the study.

4.  The average participant age was 38 years old.

5.  Most participants showed high job-search self-efficacy.

6.  The most common marital status of participants was married.

7.  Over three-fourths of the participants were white.

8.  The most common education level of participants was some college. Participants with a high school degree closely followed.

9.  There was a similar amount of men participants in the study as there was women.

10. Less than one-third of participants gained employment after the study.

## Column {data-width="350"}

# Data Visualization: Average Depression by Education Level

## Column {data-width="650"}

How does average level of depression vary by education levels?

In order to determine if education group could have a significant impact on level depression, a chart displaying the average depression level for each education group.

```{r, cache=TRUE}
average_depression <- aggregate(Depression ~ Education, df, FUN = mean)

```

```{r, cache=TRUE}
barplot(average_depression$Depression, names.arg = average_depression$Education,
        xlab = "Education", ylab = "Average Depression Level",
        main = "Average Depression Level by Level of Education Completed",
        col = "blue", ylim = c(0, max(average_depression$Depression) * 1.2))

```

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|------------------------------------------------------------------------|
| We can see that the average depression level after the study is similar across all categories of education. The categories some_highschool and graduate_work had the highest average levels of depression reported after the study was complete. This bar chat tells us that a participants level of educated completed will not have a significant influence on the predicted depression level of a participant. However, participants that have completed some high school education or graduate work may show higher depression levels. |

# Data Visualization: Distributions and Count

## Row {data-width="500"}

##### In order to understand what variables will be significant for predictive modeling, distributions of continuous variables and counts of categorical variables were analyzed.

## Row

## Continuous Predictor Variables

![](ContinuousDistributions.png)


We can see from the histograms that the distribution of age is fairly spread out, and concentrated from 25 years of age to 45. The distribution for economic hardship levels is concentrated in the middle. The distribution of Job-Search Self Efficacy is skewed right, telling us most participants showed strong efforts to find employment.

## Row

### Categorical Predictor Variables

![](CategoricalCounts.png)


We can see from the counts that sex or education level will most likely not have a significant impact on depression level as they are fairly evenly distributed.

## Row

### Response Variables

![](Response%20Variables.png)


We can see that the distribution of the depression response variable is skewed left. The count of employment also shows us that the study was only one-third successful. This tells us that it is more likely for participants to not gain employment after the study was complete.

```         
```

# Depression Analysis {data-orientation="rows"}

## Row

### Predict Median Value

For this analysis we will use a Linear Regression Model.

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
Depressionlm <- lm(Depression ~ . ,data = df)
summary(Depressionlm)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(Depressionlm)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(Depressionlm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(Depressionlm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```

## Row

### Regression Output

```{r,include=FALSE, cache=TRUE}
knitr::kable(summary(Depressionlm)$coef, digits = 3) #pretty table output
summary(Depressionlm)$coef
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(Depressionlm))[,4])  
out <- coef(summary(Depressionlm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

### Residual Assumptions Explorations

```{r, cache=TRUE}
plot(Depressionlm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```

## Row

### Analysis Summary

After examining this model, we determine that there are some predictors that are not important in predicting the depression level, so a pruned version of the model is created by removing predictors that are not significant.

## Row

### Predict Median Value Final Version

For this analysis we will use a pruned Linear Regression Model. We removed Treatment and Martial_Status

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
Depressionlm2 <- lm(Depression ~ . -Treatment -Marital_Status,data = df)
summary(Depressionlm2)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(Depressionlm2)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(Depressionlm2)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(Depressionlm2)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```

## Row

### Regression Output

```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(Depressionlm2)$coef, digits = 3) #pretty table output
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(Depressionlm2))[,4])  
out <- coef(summary(Depressionlm2))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

### Residual Assumptions Explorations

```{r, cache=TRUE}
plot(Depressionlm2, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```

## Row

### Analysis Summary

Reducing the predictors that did help with the prediction of depression levels. The R-square increased, showing an improvement in the fit of the model.

The model shows us that economic hardship, age, and sex were all significant variables for prediction of depressionlevelsl.

From the following table, we can see the effect of the Depression level by the predictor variables.

```{r, cache=TRUE}
#create table summary of predictor changes
predchang = data_frame(
  Variable = c('Econ_Hardship', 'Sex(male)', 'age','White_Nonwhite(white)','Education(some_college','Employment(unemployed)','Education(graduate_work)','job_seek', 'Education(some_highscool)','Education(highschool_degree)'),
  Direction = c('Increase','Decrease','Decrease','Increase', 'Decrease','Increase','Increase', 'Decrease','Increase', 'Decrease')
)
knitr::kable(predchang) #pretty table output

```

# Employment Analysis {data-width="500"}

## Row {data-width="500"}

### Predict Employment with a Neural Network Model

![](NeuralModel.png){width="304"} ![](NNDiagram.png){width="300"}

## Row

#### Analysis

The neural network model has a fairly high misclassification rate. With over one-fourth of the data misclassified for the validation and testing set, a neural network model is not the best fit for this set of data.

## Row {data-width="500"}

### Predict Employment with a Boosted Tree Model

![](BoostedTreeModel.png){width="335"}

![](BoostedTreeColumnContributors.png){width="345"}

## Row

#### Analysis

The boosted tree model has a similar misclassification rate but a higher r-squared value. Each variable is used in a significant amount of splits. Therefore, the variables will not be pruned for this model. The column contribution table shows us that sex, age, and job_seek were the most significant variables for prediction in this model. A boosted tree model is more effective for analysis than a neural network because the r-squared is higher. As well, highas er degree of explanation in a boosted tree model than in a neural network.